import csv
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from importlib import reload
from matplotlib import rcParams
Write a short (200 words max) summary of what the report contains and what you have discovered.
This report deals with the 'Health, Nutrition and Population Statistics' dataset. The original dataset has not been altered but has been cleaned by removing null values which not only facilitated the analysis of the dataset, but also helped with producing reliable and accurate results. In this report, I first go over the steps taken to clean the data, followed by some exploratory data analysis where I tell the reader how the dataset looks like in a different way shown in the data cleaning section. I explain how the dataset contains many unique identifier codes (which refer to an investigated issue in the dataset) but in reality they can be grouped into 22 different topics. I then talked about the amount of null values and how these changed through the years. To conclude the EDA section, I explored the countries with the least information available and why that might be. Finally, the Descriptive Analytics section dives into some analysis about the urban population around the globe. I explore how this is changing, why that might be and if the income group of a country and urban population is related. The final topic explored was age dependency ratio where I discovered that 'less developed' countries have a higher ratio than those 'more developed'.
Write a short (200 words max) introduction to the dataset, emphasising why you chose it and what you expect to discover (what you expect to discover does not have to be what you then eventually discover)
For this task I chose the "Health, Nutrition and Population Statistics" so I could further expand my knowledge in this topic, since personally I am quite interested in maintaining a balanced lifestyle by mixing sports and a varied diet. This links perfectly with this dataset since it disscuses the distributions of health and nutrition through population statistics, where topics like diabetes, death rates and tobacco use are explored. Many other topics are also present in this dataset, which spike my interest since I am very interested in how different locations behave. These other topics include information about urbanisation, labor force and economical topics (among the many other present in the dataset). I expect to discover how health is distributed around the world and spotting any trends to see if a region is generally more healthy than others. My guess is that in those countries with lower labor force or education will usually be less healthy than those with a strong labor force. Furthermore, I expect to find trends between tobacco use and death rates which I expect to be highly correlated as well as what countries have higher tobacco use and why.
In this section should demonstrate how you have programmatically cleaned and prepared the dataset you have chosen. Explain the steps you take.
First, to visualise the data, I utilised pandas to read the csv file. There was no need to determine the delimiter since by default, pandas use a comma as a delimiter, which is exactly what this dataset used. I dropped the country name to make the information less cluttered. Country name was not needed since we can use the country code to locate a country (we can look at another csv file to find the matching name/code). A column called 'Unnamed: 66' (originally had no name in the csv file) had to be dropped since it contained no data in it. Then I proceeded with erasing all the rows that had null values in them (no recorded data) for all dates as they were of no use. This resulted in having 33417 rows of data instead of 117838, reducing the data to just a third of the original size. I could have chosen to keep in those rows with some null values, which in that case I could have just deleted the rows accordingly. However, I believe that since the dataset is quite large and the total number of rows after deleting all entries with any null values is high, the data analysis will not be largely affected by this reduction. Unfortunately, this does mean that some topics such as tobacco use (mentioned in the introduction) cannot be analysed as the information included some null values. This does mean that the remaining data should be very accurate and can be analysed year by year.
StatsData = pd.read_csv('HNP_StatsData.csv')
StatsData.head()
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | Unnamed: 66 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Africa Eastern and Southern | AFE | Adolescent fertility rate (births per 1,000 wo... | SP.ADO.TFRT | 140.120535 | 140.594564 | 141.128109 | 141.32344 | 141.50816 | 141.668896 | ... | 103.391461 | 101.541782 | 99.719799 | 97.916601 | 96.146492 | 94.695223 | 93.231944 | 91.744712 | NaN | NaN |
| 1 | Africa Eastern and Southern | AFE | Adults (ages 15+) and children (0-14 years) li... | SH.HIV.TOTL | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Africa Eastern and Southern | AFE | Adults (ages 15+) and children (ages 0-14) new... | SH.HIV.INCD.TL | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Africa Eastern and Southern | AFE | Adults (ages 15+) living with HIV | SH.DYN.AIDS | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Africa Eastern and Southern | AFE | Adults (ages 15-49) newly infected with HIV | SH.HIV.INCD | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 67 columns
display(len(StatsData))
dropStatsData = ['Country Name','Unnamed: 66']
StatsData.drop(dropStatsData, inplace = True, axis =1)
dropNullStatsData = StatsData.dropna(subset=['2021','2020','2019','2018','2017','2016','2015','2014','2013','2012','2011','2010','2009','2008','2007','2006','2005','2004','2003','2002','2001','2000','1999','1998','1997','1996','1995','1994','1993','1992','1991','1990','1989','1988','1987','1986','1985','1984','1983','1982','1981','1980','1979','1978','1977','1976','1975','1974','1973','1972','1971','1970','1969','1968','1967','1966','1965','1964','1963','1962','1961','1960'])
display(len(dropNullStatsData))
117838
33417
display(dropNullStatsData)
| Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | AFE | Age dependency ratio (% of working-age populat... | SP.POP.DPND | 8.805569e+01 | 8.900535e+01 | 8.975139e+01 | 9.030093e+01 | 9.063809e+01 | 9.076959e+01 | 9.151469e+01 | ... | 8.693799e+01 | 8.632619e+01 | 8.564162e+01 | 8.490120e+01 | 8.423692e+01 | 8.349099e+01 | 8.267285e+01 | 8.178835e+01 | 8.084699e+01 | 7.998637e+01 |
| 8 | AFE | Age dependency ratio, old | SP.POP.DPND.OL | 5.805951e+00 | 5.841215e+00 | 5.854827e+00 | 5.850969e+00 | 5.833747e+00 | 5.805963e+00 | 5.836685e+00 | ... | 5.590019e+00 | 5.595846e+00 | 5.599317e+00 | 5.602250e+00 | 5.633472e+00 | 5.659007e+00 | 5.681027e+00 | 5.701516e+00 | 5.722296e+00 | 5.761120e+00 |
| 9 | AFE | Age dependency ratio, young | SP.POP.DPND.YG | 8.204644e+01 | 8.294391e+01 | 8.365833e+01 | 8.419015e+01 | 8.452215e+01 | 8.466339e+01 | 8.537528e+01 | ... | 8.024580e+01 | 7.964750e+01 | 7.898173e+01 | 7.826501e+01 | 7.758868e+01 | 7.684273e+01 | 7.603135e+01 | 7.515494e+01 | 7.422026e+01 | 7.334226e+01 |
| 238 | AFE | Population ages 00-04, female | SP.POP.0004.FE | 1.172137e+07 | 1.206910e+07 | 1.240773e+07 | 1.274886e+07 | 1.309814e+07 | 1.346958e+07 | 1.391105e+07 | ... | 4.461912e+07 | 4.541208e+07 | 4.617811e+07 | 4.696195e+07 | 4.773928e+07 | 4.848257e+07 | 4.921263e+07 | 4.994605e+07 | 5.069779e+07 | 5.146499e+07 |
| 239 | AFE | Population ages 00-04, female (% of female pop... | SP.POP.0004.FE.5Y | 1.772258e+01 | 1.779968e+01 | 1.784093e+01 | 1.786610e+01 | 1.788585e+01 | 1.792016e+01 | 1.803018e+01 | ... | 1.615277e+01 | 1.600224e+01 | 1.584081e+01 | 1.568515e+01 | 1.552739e+01 | 1.535932e+01 | 1.518890e+01 | 1.502204e+01 | 1.486343e+01 | 1.471212e+01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 117724 | ZWE | Population, total | SP.POP.TOTL | 3.776679e+06 | 3.905038e+06 | 4.039209e+06 | 4.178726e+06 | 4.322854e+06 | 4.471178e+06 | 4.623340e+06 | ... | 1.311515e+07 | 1.335038e+07 | 1.358671e+07 | 1.381464e+07 | 1.403034e+07 | 1.423660e+07 | 1.443881e+07 | 1.464547e+07 | 1.486293e+07 | 1.509217e+07 |
| 117785 | ZWE | Rural population | SP.RUR.TOTL | 3.300515e+06 | 3.404373e+06 | 3.510800e+06 | 3.611339e+06 | 3.713677e+06 | 3.817492e+06 | 3.922210e+06 | ... | 8.808921e+06 | 8.990946e+06 | 9.170486e+06 | 9.340770e+06 | 9.499100e+06 | 9.647147e+06 | 9.788215e+06 | 9.928166e+06 | 1.007082e+07 | 1.021695e+07 |
| 117786 | ZWE | Rural population (% of total population) | SP.RUR.TOTL.ZS | 8.739200e+01 | 8.717900e+01 | 8.691800e+01 | 8.642200e+01 | 8.590800e+01 | 8.538000e+01 | 8.483500e+01 | ... | 6.716600e+01 | 6.734600e+01 | 6.749600e+01 | 6.761500e+01 | 6.770400e+01 | 6.776300e+01 | 6.779100e+01 | 6.779000e+01 | 6.775800e+01 | 6.769700e+01 |
| 117827 | ZWE | Urban population | SP.URB.TOTL | 4.761640e+05 | 5.006650e+05 | 5.284090e+05 | 5.673870e+05 | 6.091770e+05 | 6.536860e+05 | 7.011300e+05 | ... | 4.306228e+06 | 4.359432e+06 | 4.416224e+06 | 4.473872e+06 | 4.531238e+06 | 4.589452e+06 | 4.650597e+06 | 4.717307e+06 | 4.792105e+06 | 4.875224e+06 |
| 117828 | ZWE | Urban population (% of total population) | SP.URB.TOTL.IN.ZS | 1.260800e+01 | 1.282100e+01 | 1.308200e+01 | 1.357800e+01 | 1.409200e+01 | 1.462000e+01 | 1.516500e+01 | ... | 3.283400e+01 | 3.265400e+01 | 3.250400e+01 | 3.238500e+01 | 3.229600e+01 | 3.223700e+01 | 3.220900e+01 | 3.221000e+01 | 3.224200e+01 | 3.230300e+01 |
33417 rows × 65 columns
For the second document, 'Short definition' had to be deleted since it had no values in it. Same with 'Unit of measure', 'Base period' and 'Other notes' as well as the other columns shown in dropStatsSeries. I was considering on deleting 'Periodicity' since I could only see 'annual' for all values but after running the code below to find out how many different values there was I found out that there are actually two different values: 'annual' and 'biennial'. Therefore I believe that these values must remain in order to not skew our data analysis. On this second document, all rows contain at least one null value, therefore we cannot get rid of every row that has a null value (further cleaning is required so that there are less null values).
'Limitation and exceptions' column is not needed. If we want to know the limitations for the collected method, we can in general just look at the 'Aggregation method' column and find out the limitations of the method because most limitations come from the way the data was collected rather than other types of limitations. 'Notes from original source' are irrelevant for the type of analysis that I am planning on doing. Furthermore, there are only 45 out of 443 entries in our dataset.
Now, this second dataset will be useful so we can take into consideration several things when analysing the first document. First, we can use the Periodicity to see how often the data was recorded (which may influence the way our data looks). We can also use this dataset to understand the method used when collecting data which alongside the other remaining columns, can give us insights on why the data looks the way it does. Finally, the development relevance can help us understand why the data collected is important. Thus, this document will eventually tell us how the data looks, why does it look like that and why is it important.
StatsSeries = pd.read_csv('HNP_StatsSeries.csv')
[{'Periodicity': k} for k in dict(StatsSeries['Periodicity'].value_counts()).items()]
dropStatsSeries = ['Short definition','Unit of measure','Base Period','Other notes','Related source links','Other web links','Related indicators','Unnamed: 20','Limitations and exceptions','Notes from original source']
StatsSeries.drop(dropStatsSeries, inplace = True, axis = 1)
display(len(StatsSeries))
StatsSeries.head()
443
| Series Code | Topic | Indicator Name | Long definition | Periodicity | Aggregation method | General comments | Source | Statistical concept and methodology | Development relevance | License Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HD.HCI.OVRL | Public Sector: Policy & institutions | Human capital index (HCI) (scale 0-1) | The HCI calculates the contributions of health... | NaN | NaN | NaN | World Bank staff calculations based on the met... | NaN | NaN | CC BY-4.0 |
| 1 | HD.HCI.OVRL.FE | Public Sector: Policy & institutions | Human capital index (HCI), female (scale 0-1) | The HCI calculates the contributions of health... | NaN | NaN | NaN | World Bank staff calculations based on the met... | NaN | NaN | CC BY-4.0 |
| 2 | HD.HCI.OVRL.LB | Public Sector: Policy & institutions | Human capital index (HCI), lower bound (scale ... | The HCI lower bound reflects uncertainty in th... | NaN | NaN | NaN | World Bank staff calculations based on the met... | NaN | NaN | CC BY-4.0 |
| 3 | HD.HCI.OVRL.LB.FE | Public Sector: Policy & institutions | Human capital index (HCI), female, lower bound... | The HCI lower bound reflects uncertainty in th... | NaN | NaN | NaN | World Bank staff calculations based on the met... | NaN | NaN | CC BY-4.0 |
| 4 | HD.HCI.OVRL.LB.MA | Public Sector: Policy & institutions | Human capital index (HCI), male, lower bound (... | The HCI lower bound reflects uncertainty in th... | NaN | NaN | NaN | World Bank staff calculations based on the met... | NaN | NaN | CC BY-4.0 |
StatsCountry = pd.read_csv('HNP_StatsCountry.csv')
columnDrop = ['Alternative conversion factor','PPP survey year','Unnamed: 30']
StatsCountry.drop(columnDrop, inplace = True, axis = 1)
moreColumnsDrop = ['Table Name','Long Name','2-alpha code','Special Notes','WB-2 code', 'National accounts base year','National acoounts reference year','SNA price valuation']
StatsCountry.isnull().sum()
Country Code 0 Short Name 0 Table Name 0 Long Name 0 2-alpha code 2 Currency Unit 48 Special Notes 139 Region 48 Income Group 49 WB-2 code 1 National accounts base year 57 National accounts reference year 192 SNA price valuation 58 Lending category 121 Other groups 206 System of National Accounts 57 Balance of Payments Manual in use 66 External debt Reporting status 142 System of trade 93 Government Accounting concept 107 IMF data dissemination standard 77 Latest population census 51 Latest household survey 111 Source of most recent Income and expenditure data 97 Vital registration complete 145 Latest agricultural census 137 Latest industrial data 118 Latest trade data 74 dtype: int64
For StatsCountry, I started by deleting all those columns without any values in them. These were 'Alternate conversion factor','PPP survey year' and 'Unnamed: 30'. Then, I proceeded by removing those columns that were less relevant for the analysis such as 'Long name' or '2-alpha code' which we can very easily find through other tables. Thus these were not needed and they were then removed. To check which columns had the most amount of nulls, .isnull().sum() was used. The higher the number displayed above, the greater the amount of nulls in that column. As you can see the data type used in this file is int64, allowing some easy analysis like plotting etc...
StatsFootNotes = pd.read_csv('HNP_StatsFootNote.csv')
StatsFootNotes.drop(['Unnamed: 4'], inplace = True, axis = 1)
nulls = StatsFootNotes.isnull().sum().to_frame()
for index, row in nulls.iterrows():
print(index, row[0])
duplicateRows = StatsFootNotes[StatsFootNotes.duplicated()]
print('\n Duplicates found: ')
display(duplicateRows)
CountryCode 0 SeriesCode 0 Year 0 DESCRIPTION 0 Duplicates found:
| CountryCode | SeriesCode | Year | DESCRIPTION |
|---|
For StatsFootNotes dataset, the column Unnamed: 4 had to be deleted. Using a for loop we can find how many null values the dataset had after the removal of the column. Since it had no null values in any of the columns, the dataset was already clean. Through the use of the duplicateRows code, we can see that there are no duplicates in this dataset (this code was ran for all other csv files and no duplicates were found thus drop_duplicate was not needed).
StatsCountrySeries = pd.read_csv('HNP_StatsCountry-Series.csv')
StatsCountrySeries.drop(['Unnamed: 3'], inplace = True, axis = 1)
duplicateRows = StatsCountrySeries[StatsCountrySeries.duplicated()]
print('\n Duplicates found: ')
display(duplicateRows)
Duplicates found:
| CountryCode | SeriesCode | DESCRIPTION |
|---|
Very little cleaning needed for StatsCountrySeries. Only had to remove the extra column generated called 'Unnamed: 3'. No duplicates were found.
StatsSeriesTime = pd.read_csv('HNP_StatsSeries-Time.csv')
StatsSeriesTime.drop(['Unnamed: 3'], inplace = True, axis = 1)
duplicateRows = StatsSeriesTime[StatsSeriesTime.duplicated()]
print('\n Duplicates found: ')
display(duplicateRows)
Duplicates found:
| SeriesCode | Year | DESCRIPTION |
|---|
Same as with the previous document. Had to delete the extra column generated and no duplicates found. This concludes the data cleaning aspects since all the 5 documents provided have been cleaned in order to proceed with the analysis.
Programmatically explore the dataset. Tell the reader, how the dataset looks like and at least three most interesting observations that you can learn from the data.
To explore a dataset we can call a function called 'describe()' that displays some descriptive statistics about it. We can see that when we call StatsSeries.describe(), the count, unique values, top value and frequency are listed with all the remaining columns that have not been deleted from the dataset. We can see that 419 out of 443 entries have a topic that describes what the data recorded means. The remanining 24 entries without a topic have been left in the dataset as all entries have an indicator name (we can add a Topic manually if needed by looking at the indicator name). It is interesting that out of these 419 topics, only 22 are unique, meaning that there are a total of 22 different topics that we can analyse. This is much less than expected when opening each dataset as they all contained hundreds/thousands of rows of data. If we look at the 'Top' row, we see that the most common Topic is 'Population dynamics: Population by age group' which essentially describes the demographics for the population. In other words, it breaks down the population based on their age and sex. It can be interesting to investigate why this is the most common topic and what causes it. This can later on be applied to our StatsData file to see what the values are for each country/region.
StatsSeries.describe()
| Series Code | Topic | Indicator Name | Long definition | Periodicity | Aggregation method | General comments | Source | Statistical concept and methodology | Development relevance | License Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 443 | 419 | 443 | 443 | 433 | 307 | 164 | 441 | 192 | 173 | 416 |
| unique | 443 | 22 | 443 | 335 | 2 | 7 | 68 | 63 | 77 | 47 | 1 |
| top | HD.HCI.OVRL | Population dynamics: Population by age group | Human capital index (HCI) (scale 0-1) | Age population, male refers to male population... | Annual | Weighted Average | Aggregate data for LIC, UMC, LMC, HIC are comp... | United Nations Population Division. World Popu... | Out-of-pocket payments are those made by peopl... | Strengthening health financing is one objectiv... | CC BY-4.0 |
| freq | 1 | 138 | 1 | 26 | 430 | 113 | 18 | 56 | 16 | 19 | 416 |
toDrop = ['Country Code','Indicator Name','Indicator Code']
StatsData.drop(toDrop, inplace = True, axis = 1)
#StatsData.count()
plt.rcParams["figure.figsize"] = (40,10) #So that the years can be readable and are not cluttered
plt.plot(StatsData.count())
plt.title('Information available through the years', fontsize = 20)
Text(0.5, 1.0, 'Information available through the years')
The second interesting thing about the dataset is about the null values. As you can see from this graph, we've got some count values on the y-axis and the years on the x-axis. It is interesting to see but not surprising that as the years go by, we have more data and less null values per year. We can say that it is very clearly increasing constantly for the first half of the years, before having a big jump; followed by an even bigger jump in amount of data. However, we see a crash close to 2021. I assume that the dataset was published during 2021, not allowing enough time to recollect all the data and thus the big crash at the end. Another argument as of why there is less data available (starting in 2019) would be due to COVID-19. During the pandemic, the world completely froze for a few months and only essential things for survival were taking place at the time ultimately meaning that data collection jobs were probably suspended for that time period. This would perfectly describing what happened in the graph. Personally I believe that COVID-19 was the cause for this fall.
countStatsData = dropNullStatsData.groupby(['Country Code']).count().sort_values('Indicator Name')
countStatsData = countStatsData.reset_index(level=0)
countStatsData
| Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MAF | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | XKX | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | CYM | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | ... | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| 3 | FRO | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | ... | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| 4 | BMU | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | ... | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 256 | IDN | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | ... | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 |
| 257 | IND | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | ... | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 |
| 258 | IRL | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | ... | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 |
| 259 | MDV | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | ... | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 |
| 260 | ZWE | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | ... | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 | 150 |
261 rows × 65 columns
For the third interesting observation, I decided to display the cleaned StatsData document. However instead of showing the values, I displayed their count values and sort it by 'Indicator Name'. This results in an increasing table starting by MAF, XKX, CYM, FRO and BMU as Country codes. These first 5 countries have values of 1 and 5 in the 'indicator name' column, meaning that they only really have 1 or 5 fully complete rows of data that we can use for analysis. The 5 country codes above correspond to St. Martin (French part), Kosovo, Cayman Islands, Faroe Islands and Bermuda respectively, making them the countries with the least amount of complete data in our dataset after going through the process of data cleaning. Looking at these 5 countries, we find that most of these are small islands that belong to bigger nations, however they are not close geographically to their respective nation. It its interesting to see how these countries, despite being somewhat managed by other nations, still don't have the same amount of information available. This may be due to them being quite remote and due to the added difficulty of getting there. The only exception is Kosovo, however, one might say that the lack of information available in this country may be justified by the ongoing conflict with neighbouring countries.
My initial idea to show these countries was to map them onto a world map but while doing this I realised that some of these countries/territories are too small to be displayed in a map, forcing me to choose a different display method.
Instead, to put this into perspective I decided to use a bar chart with the first 5 countries with the least amount of data comparing it the country with the most amount of data (multiple countries had the same maximum number of complete data which was 150, but Ireland was selected for comparison purposes)
countStatsData.drop(countStatsData.columns.difference(['Country Code','Indicator Name']), 1, inplace=True)
countStatsData.query('(index < 5) or (index == 258)', inplace = True)
countStatsData
plt.figure(figsize=(9, 3))
plt.subplot(131)
#Code to produce the bar graph
plt.title('Amount of information in countries')
plt.xlabel('Countries')
plt.ylabel('Complete rows of information')
plt.bar(countStatsData['Country Code'],countStatsData['Indicator Name'])
/var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/2562885370.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only. countStatsData.drop(countStatsData.columns.difference(['Country Code','Indicator Name']), 1, inplace=True)
<BarContainer object of 6 artists>
As seen in the bar chart above, the amount of data available in the 5 countries mentioned above is nearly nothing when compared with the amount of information available in Ireland. Bare in mind, there are countries with even less information available, but were removed through data cleaning, making these the top 5 countries with the least amount of complete data.
Interogate the dataset with descriptive analytics. Answer 3 questions that you find interesting, and that can be answered using descriptive analytics. Make each question a subsection of this section and follow the steps: 1. the question, 2. justify why it is interesting 3. the answer.
Initially I was planning on finding out what countries had the highest tobacco use since it is a very important topic. I was intending on discovering wether countries with a lower income group (and thus I assumed they had less available information) would have a higher tobacco use. However after the data cleaning, I realised that this topic did not have enough data to proceed with the analysis.
It is no lie that when we picture 'developed countries', we imagine a futuristic crammed city with tall buildings and a well functioning economy. On the other hand, when we think of a 'less developed country' we think quite the opposite: A large amount of open space with some houses/farms spread out across the land. As we tend to develop as human beings and as a society, one would imagine that we as humans would start to live in these cities that I was talking about in the beginning. Since 'developed countries' tend to have a larger portion of urban population, it seems like the only way to move towards development would be by increasing this percentage. However, how true is this?
urbanPopulation = dropNullStatsData.loc[dropNullStatsData['Indicator Code'] == 'SP.URB.TOTL.IN.ZS']
drop = ['Indicator Name','Indicator Code']
urbanPopulation.drop(drop,inplace = True, axis = 1)
df1 = urbanPopulation
df1 = df1.mean()
ruralPopulation = dropNullStatsData.loc[dropNullStatsData['Indicator Code'] == 'SP.RUR.TOTL.ZS']
drop = ['Indicator Name','Indicator Code']
ruralPopulation.drop(drop,inplace = True, axis = 1)
df2 = ruralPopulation
df2 = df2.mean()
plt=reload(plt)
#Code to produce the line graph
plt.plot(df1, label = "Urban Population")
plt.plot(df2, label = 'Rural Population')
plt.legend()
plt.title('Urban Population through the Years', fontsize = 20)
plt.ylabel('% of Population', fontsize = 20)
plt.xlabel('Years', fontsize = 20)
/var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/443255198.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy urbanPopulation.drop(drop,inplace = True, axis = 1) /var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/443255198.py:6: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. df1 = df1.mean() /var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/443255198.py:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy ruralPopulation.drop(drop,inplace = True, axis = 1) /var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/443255198.py:13: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. df2 = df2.mean()
Text(0.5, 0, 'Years')
The answer:
Looking at the data, it is undebatable that % of urban population has been growing throughout the years in a linear manner and we expect it to continue growing in such a way. As we keep evolving and developing, we tend to see larger amount of urban population. This might be due to the higher chances of getting a good job when staying in an urban area or other advantages like more education opportunities. As we compare this with rural population, we see an inverse correlation where as time goes by, rural population decreases while urban population increases. Therefore we can conclude that rural population is being replaced by urban population.
I believe there is a huge misunderstanding with this topic, since the vast majority of people think that a country with a big urban population will generally do better (in terms of economy) than one with a low percentage of urban population. However, I do not think that is entirely true, as some countries build up a great economy based on their agriculture and can proportionally do better than a very 'urban' country. Thus, I decided to analyse this further.
#StatsCountry to check against income group
StatsCountry.drop(StatsCountry.columns.difference(['Country Code','Income Group']), 1, inplace=True)
new_df = dropNullStatsData[dropNullStatsData['Indicator Code'] == 'SP.URB.TOTL.IN.ZS']
df3=pd.merge(StatsCountry,new_df, how='outer')
highIncome = df3[df3['Income Group'] == 'High income'].mean()
lowIncome = df3[df3['Income Group'] == 'Low income'].mean()
#Code to produce the line graph
plt.plot(lowIncome, label = "low income countries")
plt.plot(highIncome, label = "high income countries")
plt.title('Urban population based on Income group', fontsize = 20)
plt.ylabel('% of Population', fontsize = 20)
plt.xlabel('Years', fontsize = 20)
plt.legend()
/var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/4232131807.py:3: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only. StatsCountry.drop(StatsCountry.columns.difference(['Country Code','Income Group']), 1, inplace=True) /var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/4232131807.py:11: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. highIncome = df3[df3['Income Group'] == 'High income'].mean() /var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/4232131807.py:12: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. lowIncome = df3[df3['Income Group'] == 'Low income'].mean()
<matplotlib.legend.Legend at 0x7f82b157dfd0>
The answer:
As you can clearly see from the graph produced above, on average, higher income countries have a larger percentage of population living in urban areas. Back in the 1960s, a higher income country would have approximately 60% of their population living in urban areas whereas a lower income country would have close to 10% of its population living in urban areas. As mentioned and discovered in question 1, we found out that we tend to live in urban areas more these days. This is also portrayed in lower income countries, so it would be of no surprise if these countries start to 'develop' faster now that they are also making the transition. However, if we look at the data for 2021, a lower income country is at approximately 40% of urban population, which is less than what higher income countries had back in the 60s, showing how a low income country is probably more than 60 years behind a higher income one.
Age dependency ratio is a ratio produced by comparing the people that are not in the labor force and those who are. In other words, this ratio shows how the workforce supports those that are under 15 years old or older than 64 (dependant population). A strong age dependency ratio is benefitial for the country and demonstrates a well functioning economy. On the other hand, discovering what countries have a high age dependency ratio is extremely important as other countries may be able to help if the pressure on the workforce becomes too much.
ageDependency = dropNullStatsData.loc[dropNullStatsData['Indicator Code'] == 'SP.POP.DPND.YG']
ageDependency.drop(ageDependency.columns.difference(['Country Code','2021']), 1, inplace=True)
newStatsCountry = pd.read_csv('HNP_StatsCountry.csv')
newStatsCountry.drop(newStatsCountry.columns.difference(['Country Code','Short Name']), 1, inplace=True)
df3=pd.merge(newStatsCountry,ageDependency, how='outer')
df3 = df3.dropna(axis=0, subset=['2021'])
df3.rename(columns = {'2021':'Age Dependency Ratio'}, inplace = True)
fig = px.choropleth(df3, locations="Country Code",
color="Age Dependency Ratio",
hover_name="Short Name", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
/var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/386565537.py:3: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only. ageDependency.drop(ageDependency.columns.difference(['Country Code','2021']), 1, inplace=True) /var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/386565537.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy ageDependency.drop(ageDependency.columns.difference(['Country Code','2021']), 1, inplace=True) /var/folders/3q/xhnr8xk523j5ht7r87774zlh0000gn/T/ipykernel_77756/386565537.py:7: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only. newStatsCountry.drop(newStatsCountry.columns.difference(['Country Code','Short Name']), 1, inplace=True)
StatsData2 = pd.read_csv('HNP_StatsData.csv')
dropStatsData = ['Country Name','Unnamed: 66']
StatsData2.drop(dropStatsData, inplace = True, axis =1)
dropNullStatsData2 = StatsData2.dropna(subset=['2021','2020','2019','2018','2017','2016','2015','2014','2013','2012','2011','2010','2009','2008','2007','2006','2005','2004','2003','2002','2001','2000','1999','1998','1997','1996','1995','1994','1993','1992','1991','1990','1989','1988','1987','1986','1985','1984','1983','1982','1981','1980','1979','1978','1977','1976','1975','1974','1973','1972','1971','1970','1969','1968','1967','1966','1965','1964','1963','1962','1961','1960'])
newAgeDependency = dropNullStatsData2.loc[dropNullStatsData['Indicator Code'] == 'SP.POP.DPND.YG']
sns.catplot(data = newAgeDependency, kind = 'box', aspect = 6)
<seaborn.axisgrid.FacetGrid at 0x7f82b0587f70>
Through this coloured map, we can see how the age dependency ratio is distributed around the world. Furthermore, I used a boxplot to show the trends throughout the years (on average). We mainly see a high Age Dependencies in central Africa. While if we focus on America, we see a lower age dependency (as well as some countries in Asia an Oceania). A higher age dependency ratio suggests that there is a lot of stress in the economy and thus the dependent population can be too large for the workforce to support. Therefore, we tend to see a higher value for this measure in 'less developed' countries where the economy has to focus its resources in other important things. However, through the boxplot, we see that as time goes by this ratio tends to decrease in a very linear manner since not only the mean but the actual boxes shown are moving towards the origin of the y-axis, which suggests an improvement in this aspect of the economy. Since in this case we are analysing the average values for all countries as whole, we can conclude that in general, we are moving towards a point where there is less stress for an economy (at least in this sector of it) which will possibly impact other sections of the economy in a positive manner. However, through the use of the map, it is easy to identify those countries that may require external help to solve this problem and is therefore very useful to visualise it in such manner.
Suggest further steps to enrich the dataset with additional (external) data, or apply some additional advanced modelling or analysis. You do not have to fully implement your suggestions, but you will have to explain and justify your suggestions.
Despite this dataset having multiple null values, I believe the dataset is quite accurate when it comes to the actual values that have been collected. However, there are a couple things that could improve the actual data as well as the analysis. Unfortunately, after cleaning the data, we lost many entries. For example, we cannot ammend the empty values for the past years since those records were simply not collected, but if through the use of a trained artificial intelligence by machine learning we can predict the future values, we should also be able to somewhat predict past values. Thus, a 'simulated' dataset with all predicted values for both past and future years would be a great tool to further evaluate the current situation around the world. Also, making sure that more data is collected as we advance in time will be helpful as we could analyse real data. This, as it has been seen in the introduction of this report, is already happening (more data is available every year).
Expanding on the thought of an artificial intelligence, we could use this very dataset to train it in specific areas where there is enough data. It would be very interesting to use this artificial intelligence and fill out the empty values for those countries with no records by considering the characteristics of other countries with data and then finding the most similar country (with data) to that without data. This would be a way of filling out the remaining data. Since an artificial intelligence is capable of finding connections between datasets, this would be extremely helpful and time efficient.
We could then compare the future (collected) results with those predicted by the artificial intelligence to see how accurate it can get. In case it is not very accurate, it would only require more training by allowing it to read more recorded data, ultimately making it really efficient. Furthermore, if we can get this to predict results accurately, we could use this technology to predict the values for the past years where there is no data (since we cannot go back in time to collect it ourselves, this is one of the few options that we have). By having that extra data from the past we could then go onto analysing trends again to see if we find any new interesting data as we would now take advantage of all the 22 topics that appear in the dataset (instead of the few remaining after the data cleaning)
Through pattern recognition (a trained AI), we could easily predict the future for any sort of problem that we might encounter. In terms of health, nutrition and population, we might face the problem of over-populating the world at some point. If we can predict this with enough time, we could figure a way to slow the problem down or even get rid of it completely, showing the importance of this technology.
List all sources you have utilised in the making of this report here.